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Abstract 



Increasing interest in applying statistical learning methods to functional magnetic resonance imaging 
(fMRI) data has led to growing application of off-the-shelf classification and regression methods. These 
methods allow investigators to use standard software packages to accurately decode perceptual, cognitive, or 
behavioral states from distributed patterns of neural activity. However, such models are generally difficult 
to interpret when fit to whole-brain fMRI data, as they suffer from coefficient instability, are sensitive to 
outliers, and typically return dense rather than parsimonious solutions. Here, we develop several variants 
of the the graph-constrained elastic net (GraphNet)-a fast, whole-brain regression and classification method 



developed for spatial and temporally correlated data that yields interpretable model parameters ( Grosenick 



et al. 2009). GraphNet models yield interpretable solutions by incorporating sparse graph priors based on 



model smoothness or connectivity as well as a global sparsity inducing prior that automatically selects im- 
portant variables. Because these methods can be fit to large data sets efficiently and can improve accuracy 
relative to volume-of-interest (VOI) methods, they allow investigators to take a more objective approach to 
model fitting — avoiding the selection bias associated with VOI analyses or the multiple comparison problems 
of roaming VOI ("searchlight") methods. As fMRI data are unlikely to be Gaussian, we (1) extend GraphNet 
to include robust loss functions that confer insensitivity to outliers, (2) equip them with "adaptive" penalties 
that have oracle properties, and (3) develop a novel sparse structured support vector machine (SVM) Graph- 
Net classifier to allow maximum-margin classification with GraphNet models. When applied to previously 
published data (Knutson et al. [2007 1, these efficient whole-brain methods significantly improved classifica- 
tion accuracy over previously reported VOI-based analyses ( Knutson et al.|2007 Grosenick et al.|2008 l while 
discovering task-related regions not included in the original VOI approach. 
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1. Introduction 



In functional magnetic resonance imaging (fMRI) studies, investigators measure neural "activity" with 
the blood oxygen-level dependent (BOLD) signal, and relate this activity to psychophysical or psychological 
variables of interest. Historically, modeling is performed one voxel at a time to yield a map of univariate 
statistics that are then thresholded according to some heuristic to yield a "brain map" suitable for visual 
inspection. Over the past decade, however, there has been a growing number of neuroimaging studies 
applying statistical learning to fMRI data in order to model effects across multiple voxels. Commonly referred 
to as "multivariate pattern analysis" (Hanke et al. 2009 1 or "decoding" — to distinguish them from more 
commonly-used mass-univariate methods ( Friston et al.||1995 l — these approachs have enabled investigators 
to use patterns of activation across multiple voxels to classify categories of images during subject viewing 
( Peelen et al.|2009 Shinkareva et al.|2008 1, categories of images during memory retrieval ( Polyn et al.|2005 l, 
intentions to act ( Haynes et al.|2007[) , and intentions to purchase ( Grosenick et al.|2008 l, to name just a few 



applications (see also Norman et al. 2006 Haynes and Rees 2006 Pereira et al. 2009 O'Toole et al. 2007 



Bray et al.|2009 and the many examples in Neurolmage Volume 56 Issue 2). In a variety of cases, statistical 



learning algorithms have increased model sensitivity relative to standard mass-univariate analyses (|Haynes 
and ReeslpOe) [Pereira et al.||2009| . 



Despite these advances, the application of statistical learning to neuroimaging data is still developing. 

[2^061, 



Most research applying statistical learning to fMRI has been conducted by a few labs (Norman et al. 



and it remains standard practice to use off-the-shelf classifiers (Norman et al. 2006 Pereira et al. 



2009J 



(but cf. Hutchinson et al. 2009 Chappell et al. 2009 Brodersen et al. 20111. These models are typically 



applied to volume of interest data within subjects instead of to whole-brain data across subjects (Etzel et al. 



2009 


Pereira et al.||2009|) 


2010 


Ryali et al.||2010 


1. 



they were not developed with whole-brain neuroimaging data in mind, and so suffer from inefficiencies 
when applied in the neuroimaging context. In particular, the large number of features (usually voxel data) 
and spatiotemporal autocorrelations characteristic of fMRI data present unique problems that off-the-shelf 
classifiers are not designed to solve. 

Indeed, in the machine learning literature, the purpose of off-the-shelf classifiers such as discriminant 
analysis (DA), naive Bayes (NB), k-nearest neighbors (kNN), and support vector machines (SVM) has been 
to quickly and easily yield high classification accuracy on tasks where accuracy itself is the primary objective- 
for example speech recognition or hand- written digit identification (Hastie et al. 2009). However, beyond 
accuracy, neuroscientists would like to make claims about what brain areas correlate with particular behav- 
iors or stimuli at particular points in time (or more precisely, which mental properties supervene on brain 
properties (Vogelstein et al. In Press l). This aim requires classification or regression methods that yield 
clearly interpretable sets of model coefficients. For this reason, recent literature on multivariate classifica- 
tion of fMRI data recommends that linear classifiers such as logistic regression (LR), linear discriminant 
analysis (LDA), Gaussian Naive Bayes (GNB), or linear SVM be preferred over nonlinear classifiers in fMRI 
applications ( Haynes and Rees||2006[ Pereira et al.||200^ l. 

However, it is a mistake to think that linearity alone will guarantee that a model will yield stable and 
interpretable coefficients. For instance, in the case of many correlated input variables, it is known that LR, 
LDA, and GNB all yield unstable coefficients and degenerate covariance estimates-particularly if applied to 
smoothed data (Hastie et al. 1995 20091. In the classification context, penalized least squares approaches 



tend to oversmooth the coefficients, complicating interpretation ( FriedmEiii||1997 1. Additionally, most linear 
classifiers return dense sets of coefficients (as in Figure 1, left panels) that require a threshold or a feature 
selection step in order to to yield a parsimonious model. Although some heuristic methods exist for coefficient 
selection, these are in general greedy (e.g., forward/backward stage-wise procedures like Recursive Feature 
Elimination ( Guyon and Vapnik||2002 De Martino et aL]|2007 Bray et ar]|2009 1) and can be unstable over 
resampling of the data (as they are easily stuck in local minima) (Hastie et al. 20091. While principled 
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Figure 1. Mid-sagittal and coronal plots of coefRcients from dense, sparse, and structured sparse models. Warm colored 
coefficients indicate a positive relationship with the target variable (here predicting the decision to buy a product, cool colors 
a negative relationship. Sparse models set many model coefficients to zero, while in dense models almost all coefficients are 
nonzero. Structured sparse models use a penalty on differences between selected voxels to impose a structure on the model fit 
so that it yields coefficients that are both sparse and structured (e.g., smooth). Log-histograms of the estimated voxel- wise 
coefficients show that the sparse model coefficients have a near Laplacian (double-exponential) distribuion, while the dense 
coefficients have a near Gaussian distribution. The structured sparse model coefficients are a product of these distributions 
(see also Figure 2). Coefficients penalties that yield each model type and examples of such models are given below each 
column. 
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methods exist for choosing a threshold for dense mass-univariate maps of coefficients (e.g. Random Field 
Theory ( Adler and Taylor||2000 Worsley et al.|[2004| ), such approaches are not currently available for dense 
multivariate regression and classification methods. 

Recently, sparse regression models have been applied to neuroimaging data to yield reduced sets of 
coefficients chosen automatically during the model fitting process. The first examples in the fMRI literature 
are ( Yamashita et al.]|2008 ), who apply sparse logistic regression (Tibshirani 1996) to classification of visual 
stimuli, and ( Grosenick et al.|2008 l who develop sparse penalized discriminant analysis to turn "elastic net" 
regression (Zou and Hastie 20051 into a classifier and apply it to choice prediction. Subsequently, sparse 

have 



models for regression (Carroll et al. 2009 Hanke et al. 20091 and classification (Hanke et al. 



2010 



2009) 



been applied to fMRI data to yield reduced sets of coefficients from VOIs, whole-brain volumes (Ryali et al. 
van Gerven et al. 20101 and whole brain volumes over multiple time points ( Grosenick et al.[2009[ 



20101). In general these methods use an £i-penalty (sum of absolute values) on the model coefficients, which 
results in many of the estimated coefficients being set to zero (see Figure 1, leftmost panels, and Figure 2b). 
When applied without modification to correlated fMRI data, however, ^i-penalized models can select an 
overly sparse solution-resulting in omission of relevant features and yielding unstable coefficient estimates 

To allow relevant but correlated 



(2008). 



over cross-validation folds Zou and Hastie (2005); Grosenick et al 
coefficients to coexist in a sparse model, recent approaches to fMRI regression ( Carroll et al.|p009 Li et al 
2009) and classification ( Grosenick et al.|2008[ 2009 Ryali et al.|2010 ) use a hybrid £i- and £2-norm penalty 



("elastic net" penalty Zou and Hastie (2005)) on the coefficients. These approaches allow correlated variables 
to be included together in a sparse model. 

We explore models that employ a modified version of the the elastic net penalty that includes a general 
user-specified sparse graph penalty. This penalty allows the user to efficiently incorporate physiological con- 
straints and prior information in a model. The resulting graph-constrained elastic net (GraphNet) regression 
( Grosenick et al.||2009 2010) can find "structured sparsity" in correlated data with many features (Figure 1, 
right panels) and is related to previous approaches in the manifold learning (Belkin et al. 2006) and gene 
microarray literatures ( Li and Li|2008 ). Related "sparse structured" methods in the statistics literature have 
recently been shown to have desirable convergence and variable selection properties for large correlated data 
(Slawski et al. 2010 Jenatton et al.||2011 ), and structured sparse coefficients can also be achieved within a 
Bayesian framework ( van Gerven et al.pOlO ). Here we extend the performance of GraphNet regression and 
classification methods on whole-brain fMRI data, making them robust to outliers (for both regression and 
classification), adding "adaptive" penalization to reduce model bias and improve model selection, and devel- 
oping a novel support vector GraphNet (SVGN) classifier. To allow these methods to be fit to whole-brain 



fMRI data over multiple time-points, we adapt results from the applied statistics literature ([Friedman and 
Tibshirani||2010 ) to fit them models to large data sets efficiently. 



After developing robust and adaptive GraphNet regression and classification methods, we demonstrate the 
utility of GraphNet classifiers on previously published data ( Knutson et al.|[2007 ). Specifically, we use them 
to predict subjects' purchasing behavior on a trial-to-trial basis using their whole-brain BOLD response 
over several time points at once, and make inferences about which brain regions discriminate upcoming 
decisions to purchase versus not purchase a product. Fitting the models to 25 subjects whole-brain data 
over 7 TRs yielded classification rates better than those found previously in a volume-of-interest (VOI) based 
classification analysis ( Grosenick et al.|2008 ) , and than those obtained with the linear support vector machine 
(SVM) classifier fit to the full data. Our whole-brain, spatiotemporal results confirm the relevance of the 
vols chosen in the previous studies (bilateral nucleus accumbens, medial prefrontal cortex, and insula), 
but also implicate previously unchosen areas (notably bilateral ventral tegmental area (VTA) and posterior 
cingulate) . We conclude with a discussion of how to interpret GraphNet model coefficients appropriately, as 
well as future improvements, applications, and extensions of these models. 



2. Methods 

2. 1 . Preliminaries 

2.1.1. Penalized least squares regression 
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Many classification and regression problems can be formulated as modeling a response vector y £ M" as 
a function of data matrix X G K"^^', which consists of n observations each of length p. In particular, a large 
number of models treat y as a linear combination of the predictors in the presence of noise e G K", such that 

y = Xp + e, (1) 

where e is a noise term typically assumed to be normally distributed e ~ A/'(0, cr^) with mean and variance 
In this case using squared error loss leads to the well-known ordinary least squares (OLS) solution 

P = argmin \\v - X(i\\l = (X^Xy^X^y, (2) 





which yields the best linear unbiased estimator (BLUE) if the columns of X are uncorrelated (Lehmann and 
Casella|[T998l ). 



However, this estimator is inefficient in general for p > 2 — it is dominated by biased estimators (Stein 



1956 ) — and if the columns of X are correlated (i.e. are "multicoUinear") then the estimated coefficient values 
can vary erratically with small changes in the data and the OLS fit can be quite poor. A common solution to 
this problem is penalized (or "regularized") least squares regression ( jTikhonov|1943 1, in which the magnitude 



of the model coefficients are penalized in order to stabilize them. This is accomplished by adding a penalty 
term 7^(/3) on the coefficient vector yielding 

I = argmin \\y - Xp\\l + XP{I3), A e M+, (3) 

where A is a parameter that trades off fit with sparsity (variance with bias) and M+ is the set of nonnegative 
scalars. These estimates are equivalent to maximimum a posteriori (M AP) estimates from a Bayesian 
perspective (with a Gaussian prior on the coefficients if V{fi) — ||/3||2 ("Hastie et al. 2009 1), or to the 



Lagrangian relaxation of a constrained bi-criterion optimization problem ( |Boyd and Vandenberghe 2004 1 



Such equivalencies motivate various interpretations of the model coefficients and parameter A (see section 
2.3). 



2.1.2. Sparse regression and automatic variable selection 



There are a few standard choices for penalty V{I3). Letting 'P{I3) — WPW^ = X^^^i (^^e ^2-n orm) 
gives the now classical "ri dge" regression estimates originally proposed for such problems ( [Tikhonov 1943[ 
' " ' |i = X;Li Ift I (the "^1 norm")-called the 



Hoerl and Kennard 



19701. More recently the choice V{l3) 



Least Absolute Shrinkage and Selection Operator (or "lasso") penalty in the regression context (Tibshirani 



1996 )-has become exceedingly popular in statistics, engineering, and computer science, leading some to call 
i-regression the "modern least squares" (Candes et al. 2008). In addition to shrinking the coefficient 

many are 



such 



estimates, the lasso performs variable selection by producing sparse coefficient estimates (i.e 
exactly equal to zero. Figure 1 left panels). In many applications having a sparse vector /3 is highly desirable; 
a model with fewer non-zero coefficients is simpler, and can help identify and emphasize which predictors 
have an important relationship with the response variable y. 

The £i norm used in the lasso is the closest convex relaxation of the €o pseudo-norm ||/3||o = If/JjT^ojj 
where l{^^^o} is an indicator function that is 1 if the jth coefficient /3j is nonzero and otherwise. This 
is just a penalty on the number of nonzero coefficients, and corresponds to finding the sparsest possible 
model. However, finding such a solution involves a combinatorial search through possible models (a form of 
"all subsets regression" (Hastie et al. 2009)) and so is computationally infeasible for even a modestly large 
number of input features. An ^i-norm penalty can be used as a heuristic that results in coefficient sparsity 
(in the maximum a posteriori (MAP) estimates — for a fully Bayesian approach see ( van Gerven et al.|2010 )). 
Such £i-penalized regression methods set many variables equal to zero and automatically select only a small 
subset of relevant variables to assign nonzero coefficients. While these methods yield the sparsest possible 
model in many cases ( Candes et al.|2003 Donoho||2006 ), they do not always do so, and reweighted methods 
such as Automatic Relevance Determination (Wipf and Nagarajan 20081 and iterative reweighting of the 
£i-penalty Candes et al. (2008) exist for finding sparser estimates. 
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Figure 2. (a) Diagrammatic representation of squared-error (red), Huber (black), and Huberized Hinge (green) loss functions. 
Dotted lines denote where the Huber loss changes from penalizing residuals quadratically (where |j/ — J/| < <5) to penalizing 
them linearly (where |?; — y| > S). The linear penalty on large residuals makes the Huber loss robust, (b) Diagrammatic 
representation of convex penalty functions used in this article (along one coordinate /3). The red curve is a quadratic penalty 
7'(/3) = on coefficient magnitude, often called the "ridge" penalty in regression. The blue curve is the LASSO penalty on 
coefficient magnitude 'P(/3) = |/3|. The purple curve is a convex combination of the read and blue curves: + (1 — a)|/3| 
(where here a = 0.5), called the "Elastic Net" penalty. The inset shows the prior distribution on the coefficient estimates that 
each of these penalties corresponds to: Gaussian (red), Laplacian (blue), and mixed Gaussian and Laplacian (purple). 



Despite offering a sparse solution and automatic variable selection, there are several disadvantages to 
using ^1— penalized methods lifce the lasso in practice. For example, if there is a group of highly correlated 
predictors the lasso will tvpicallv se lect a subset of "representative" predictors from this group to include 
in the model (Zou and Hastie 20051. This can make it difficult to interpret /3 because a coefficient being 
doesn't necessarily mean the corresponding predictor isn't useful for modeling y (i.e., false negatives are 
likely). Worse, entirely different subsets of the coefficients may be selected over resamplings of the data 



(during cross-validation, for example). Moreover, the lasso can select at most n non-zero coefficients(Zou 



and Hastie 2005 1 , which may be undesirable in problems where the number of input features p is much 
larger than the number of observations n ("p >> n" problems). Finally, as a global shrinkage method it 



biases model coefficients towards zero ( Tibshiraiii||1996 Hastie et al.||2009 l, making it difficult to interpret 
their magnitude in terms of original data units. Other models based on only an ^i-penalty, such as sparse 



logistic/multinomial regression and sparse SVM (Hastie et al. 



In response to several of these concerns (Zou and Hastie 



2009 1 suffer from the same problems. 



20051 proposed the elastic net, which uses a 



mixture of ii and £2 regularization, and may be written 



P = K argmin 



\\y-Xp\\l + \,m\, + \2\ 



(4) 



This elastic net estimator overcomes several (though not all) of the disadvantages discussed above, while 
maintaining many advantages of ridge regression and the lasso. In particular, the elastic net accomodates 
groups of correlated variables in the model and can select up to p variables with non-zero coefficients. The 
amount of sparsity in the solution vector can be tuned by adjusting the penalty coefficients Ai and A2. In 
this case, the £1 penalty can be thought of as a heuristic for enforcing sparsity, while the £2 penalty permits 
correlated variables in the model and stabilized the sample covariance estimate. This approach has been 



2008 



shown to perform well on £MRI data in both regression and classification settings ( Grosenick et al. 

Carroll et al.||200"9 Ryali et al.||2010 1 . The factor k = 1 + A2 in ([4]) and subsequent equations is a rescaling 
factor discussed in further detail below. 
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2.1.3. Optimal Scoring (OS) and Sparse Penalized Discriminant Analysis (SPDA) 



Sparse regression methods Hke the lasso or elastic net can be turned into sparse classifiers fLeng"2008j 
Grosenick et al. 2008 Clemmensen et al. In Press I. Naively, we might imagine performing a two-class 



classification simply by running a regression with lasso or the elastic net on a target vector containing I's 
and O's depending on the class of each observation yi S {0, 1}. We would then take the predicted values 
from the regression y and classify to if the ith. estimate yi < 0.5 and to 1 if the estimate yi > 0.5 (for 
example). In the multi-class case (i.e. J classes with J > 2), multi-response linear regression could be used 
as a classifier in a similar way. This would be done by constructing an indicator response matrix Y , with n 
rows and J columns (where again n is the number of observations and J is the number of classes) . Then the 
ith row of Y has a 1 in the jth column if the observation is in the jth class and Os otherwise. If we run a 
multiple linear regression of Y on our predictors X, we can the results for a one-against-all classification by 
classifying the ith observation to the class having the largest fitted value Yn, Yi2, Yij. With the exception 
of binary classification on balanced data, this classifier has several disadvantages, including that the estimates 
Yij are not probabilities, and that in the multi-class case certain classes can be "masked" by others, resulting 



in decreased classification accuracy (Hastie et al. 20091. However, it has been known for some time that 



applying LDA to the fitted values of such a multiple linear regression classifier is mathematically equivalent 
to fitting the full LDA model f Breiman and Ihaka||1984 1, yielding posterior probabilities for the classes and 
in some cases dramatically improving the classification performance over the original multivariate regression 



(Hastie et al. 1994 1995 2009). 



Hastie et al. 


19941 and ( 


Hastie et al. 


1995 



1995) exploit this fact and an equivalence between LDA and 
canonical correlation analysis to develop a procedure they call Optimal Scoring (OS). OS allows us to build 
a classifier by first fitting a multiple regression to Y using an arbitrary regression method, and then linearly 
transforming the fitted results of this regression using the OS procedure (see Appendix). This yields both 
class probability estimates and discriminant coordinates, and allows us to use any number of regression 
methods as discriminant classifiers. This approach is discussed in detail for nonlinear regression methods 



applied to a few input features in (Hastie et al. 1994), and for regularized regression methods applied to 



numerous (hundreds) of correlated input features in (Hastie et al. 1995). Here we extend the results of 
the latter work to include sparse structured regression methods that can be fit efficiently to hundreds of 
thousands of input features. 

More formally, OS finds an optimal scoring function 9 : g ^ R that maps classes g e {1, ... J} into the real 
numbers. In the case of a multi-class classification using the elastic net, we can apply OS to yield estimates 



(e,/3) 



K argmm 

e,0 



ire 



x/3\\i + \,m\ 



subject to n-^\\Ye\\l = 1, 



(5) 
(6) 



where O is a matrix that yields the optimal scores when applied to indicator matrix Y , and where we add the 
constr aint ([7| to av oid degenerate solutions ( Grosenick et al.|2008 l. Given that this is just a sparse version of 
PDA ( Hastie et al.||199^ l we have call this combination Sparse Penalized Discriminant Analysis (SPDA) (it 
has also been called Sparse Discriminant Analysis (SDA) ( Clemmensen et aL]|In Press )). For an interesting 
alternative approach for constructing sparse linear discriminant classifiers see Witten and Tibshirani| (In 
Press). 



2.1.4. Robust loss functions 

More generally, we can formulate the penalized regression problem we are interested in as minimizing the 
penalized empirical risk 7?.(/3) as a function of the coefficients, so that 



argmin TZ{l3) 



argmin £(y, y) 



XV if)), 



(7) 
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where y is our estimate of response variable y (note y = Xf) in the linear models we consider) and C{y,y) 
is a loss function that penalizes differences between the estimated and true values of y. For example, in the 
above models (|3|-(|5| we used C{y,y) = \\y — y\\2 — T^^=ihji~yiY ("squared error loss"). While squared error 
loss enjoys many desirable properties under the assumption of Gaussian noise, it is sensitive to the presence 
of outliers. 

Outlying data points are an important consideration when constructing models for fMRI data, in which 
a variety of factors from residual motion artifacts to field inhomogeneities can cause some observations to 
fall far from the sample mean. In the case of standard squared-error loss (as in equations ([2|-([5])), these 
outliers can have undue influence on the model due to the quadratically increasing penalty on the residuals 
(see Figure 2a). A standard solution in such cases is to use a robust loss function, such as the Huber loss 
( |Huber|[2009| 



N 



^H{y,y;^) = ^psiyi-m) (8) 

i=l 

(y»-yi)^/2 ioT\yi~%\<s 



where ps{yi - yi) 



This function penalizes residuals quadratically when they are less than or equal to parameter 6, and linearly 
when they are larger than S (Figure 2a) . A well specified 6 can thus significantly reduce the effects of large 
residuals (outliers) on the model, as they no longer have the leverage resulting from a quadratic penalty. 
As (5 — ;> oo (or practically when it becomes larger than the most outlying residual) we recover the standard 
squared-error loss. 

2.2. Graph- constrained Elastic Net (GraphNet) regression methods and a family of SPDA-based GraphNet 
classifiers 

So far we have seen that sparse regression methods like the elastic net, which use a hybrid £i- and ^2-norm 
penalty, can be used to fit sparse models that do not exclude correlated variables ( Zou and Hastie|2005 1, and 



that we can turn these regression models into classifiers that perform well when fit to VOI data (Grosenick 
et al. 112008). However, in a sense the elastic net penalty merely makes the model fitting procedure "blind" to 



the correlations between input features (by shrinking the sample estimate of the covariance matrix towards 
the identity matrix). Indeed, if we let A2 in equation ([S]) get very large, this model is equivalent to applying 
a threshold to mass-univariate OLS regression coefficients (i.e. our estimate of the covariance matrix is a 
scaled identity matrix) (Zou and Hastie,|2005 ) . 

In this section, we describe a modification of the elastic net that explicitly imposes structure on the 
model coefficients. This allows the analyst to pre-specify constraints on the model coefficients based on prior 
information, physiological constraints, or desirable model properties, and then tune how strongly the model 
must adhere to these constraints. As the user-specified constraints take the general form of an undirected 



graph, we have called this regression method the graph-constrained elastic net (GraphNet) ( Grosenick et al 



2009| [20101. 

The GraphNet model closely resembles the elastic net model, but with a modification to the ^2-norm 
penalty term: 

P = Kargmin ||y-X/3||2 + Ai||/3||i + AG||/3||^ (9) 

p p 

where ||/3||^ = /3^G^ = E E l^jG.A, 

3=1 fe=i 

where G is a sparse graph. Note that in the case where G = I, where / denotes the identity matrix, the 
GraphNet reduces back to the elastic net. Thus the elastic net is a special case of GraphNet and we can 
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replicate the effects of increasing an elastic net penalty by adding a scaled version of the identity matrix 
(Aa/Ac)/ to G (for Ac > 0). 



2.2.1. GraphNet with the graph Laplacian penalty 



The example we will use for the matrix G in the remainder of this paper is the graph Laplacian. If we 
take the coefficients /3 to be functions over the brain volume y G R'^ over time points T g M such that 
/3(a;, y, z, t), then we would like a penalty which penalizes roughness in the coefficients as measured by their 
derivatives over space and time, such as 



V,T 



df3 d(3 dp dp 
dx dy dz dt 



dx dy dz dt 



A/3 dx dy dz dt. 



(10) 



V.T 



Since we are sampling discretely, we use the discrete version of the Laplacian A, the matrix Laplacian L, as 
described previously (Hastie et al. 19951. This formulation generalizes well to arbitrary graph connectivity 
and is widely used in spectral clustering techniques ( jBelkin and Niyogi]|2008 1. 



In the case where G is the matrix Laplacian L, the graph penalty, ||/9||g) has the simple representation 



E (A 



where A is the set of index pairs for "adjacent" voxels (and where adjacency is defined by entries in L). 
When written in this way it is very easy to see how the graph penalty induces smoothness: it penalizes the 
size of the pairwise differences between adjacent coefficients. If the quadratic terms [Pi — /3j)^ were replaced 
by absolute deviations, \Pi — then the model would be an instance of the "fused LASSO" (Tibshirani 



et al. 2005 1 . There are two main reasons we might prefer the use of a quadratic penalty in the present 



application: 



The fused LASSO is related to Total Variation (TV) denoising (Rudin et al. 19921 and tends to set 
many of the pairwise differences Pi — Pj to zero, creating a sharp piecewise constant set of coefficients 
that lacks the smoothness often expected in fMRI data. 

There are significant algorithmic complications that can be avoided by choosing a differentiable penalty 
on the pairwise differences ( Tseng|20^H Friedman et al.|2007a I , speeding up model fitting considerably. 



For simplicity we consider only a local spatiotemporal smoothing penalty in the current study, although using 
more elaborate spatial/temporal coordinates would follow similar logic. The SPDA-GraphNet is defined as 



(e,/3) 



K argmin ||re - Xpg + X,\\p\\, + XaWPWl 



subject to n "'^||Y8||2=1. 



(11) 
(12) 



We note that in a multi-class classification problem, we would likely want to iteratively minimize over O and 
P (Clemmensen et ar||In Press I. 



2.2.2. Robust GraphNet 



Having developed GraphNet using squared-error loss, we can now modify it to include a robust penalty 
like the Huber loss defined above. Replacing the squared error loss function with the loss function ([8| we 
have 

P = K argmin Cniv, X P; S) + X,\\P\U + XaWPWl- (13) 

/3 
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We define the SPDA-RGN classifier similarly to that for the standard GraphNet Note, however, that 

the model now has an addition hyperparameter to either be fit (or assumed) : the value of S that determines 
where the loss function switches from the quadratic to the linear regime (Figure 2a). Further, the loss 
function on the residuals is no longer quadratic and therefore could lead to much slower convergence of our 
optimization method. We discuss a solution to this issue next. 



2.2.3. Infimal convolution for non- quadratic loss functions 



Convergence speed of subgradient methods such as coordinate-wise descent is greatly improved if the 
loss function has a quadradic form. Non-quadratic loss functions like the Huber and the Huberized-hinge 
loss functions are not quadradic as written in [8] and thus could take numerous iterations to converage for 
each coefficient — significantly increasing computation time. Indeed, these methods are not amenable to 
optimization using coordinate descent in general, as their objectives are written as conditional functions 
with different loss conditions parameterized by 5. However, we can circumvent these problems and extend 
the applicability of coordinate-wise descent methonds using a trick from convex analysis to rewrite these 
loss functions as quadradic forms in an augmented set of variables. This method, called infimal convolution 
( Rockafellar|| 1 970 ) , is defined as 



{fMnfg){x) :=inf{/(a;-2;)+.g(2;)|yeM"}, (14) 

y 

where / and g are two functions of a; € M^. In this way it is possible to rewrite the ith term in the the 
Huber loss function Q as the infimal convolution of the squared and absolute- value functions applied to 
the ith residual r^: 

P5(r,) = ((l/2)(.)2*i„f|-|)('^.)= inf a?/2 + '5|&.|, (15) 

ai-f-bi—Ti 

where ri — yi — {X^)i . This yields the augmented estimation problem 

aj= argmin (l/2)||t/ -Xp-a\\l + XaP'^Gp + S\\a\\i + Ai||/3||i, (16) 



which we can rewrite as 



p+n 

argmin (1/2) ||y - Z-fWl + Xal'^G'^ + ^ w,\jj\ (17) 
^ j=i 



Z=[X /„x„], 7-[/3 a 
G' 




G Oixn 
On X 1 On X n 



where 5'™^™ is the set of positive semidefinite m x m matrices. This is just a GraphNet problem in an 
augmented set of p -I- n variables, and so can be solved using the fast coordinate-wise descent methods 
discussed in section 2.4 below. After solving for augmented coefficients 7 we can simply discard the last n 
coefficients to yield /3. A similar approach can be taken with the hinge-loss of a support vector machine 
classifier (as we show next), or more generally with any loss function decomposable into an infimal 
convolution of convex functions (see Appendix) . 



2.2.4. Support Vector Machine (SVM) GraphNet for classification 

In the n << p classification problem, maximum-margin classifiers such as the support vector machine 
often perform exceedingly well in terms of classification accuracy, but do not yield readily interpretable 
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coefficients. For this reason we also develop a sparse SVM with graph constraints (SVM-GraphNet) related 
to the "Hybrid Huberized SVM" of (iWang et al.||2008|) as an alternative to the SPDA method. Using a 



'Huberized-hinge' loss function on the model residuals we have 

P = K argmin Chh{v'^XP-5) + Ai||/3||i + Xg\\P\\1, 



(18) 



where y e { — 1,1}. Letting y = Xj3 he our estimates of the target variable, 



t^HH {v,y;5) 



where 



N 

E 

2 = 1 



ps{vi,yi) 




(19) 



for 1 - J < yijji < 1 
for ySi < 1 - (5 
for y^yi > 1, 



which is the Huberized-hinge loss of CWang et al.| 2008 1 . As with the Huber loss there is an additional 



hyperparameter 6 to be fit or assumed. In this case S determines where the hinge-loss function switches from 
the quadratic to the linear regime (see Figure 2a). This problem's loss function can also be written using 
infimal convolution to yield a more convenient quadratic objective term (see Appendix). 



2.2.5. Adaptive GraphNet models 



Automatic variable selection using the methods above is based on an approach that shrinks coefficient 
estimates towards zero, resulting in downwardly biased coefficient magnitudes. This (a) makes it difficult 
to interpret coefficient magnitude in terms of original data units, and (b) severely restricts the conditions 
under which the lasso can perform consistent variable selection ( Zou||200^ |. Ideally our model would select 
the correct parsimonious set of features (the "true model", were it known) , but avoid shrinking the nonzero 
coefhcients that are kept in the model (unbiased estimation) — desiderata known as the "oracle" property 
( [Fan and Li]|2001[ ). 

Several estimators possessing the oracle property (given certain conditions on the data) have been reported 
in the literature, including the adaptive lasso ( Zou|200B Zhou et al.|2011 1 and the adaptive elastic net (Zou 
and Zhang 20091. These estimators are simple modifications of penalized linear models. They work by 



starting with some initial estimates of the coefficients and use these to adaptively reweight the penalty on 
each individual coefficient /3j. We rewrite the robust GraphNet to have an adaptive penalty (the robust 
adaptive GraphNet) as follows 



Wi 



p 



-^RAGN ^ argmin /:h(2/,^/3;<^) + AiV%|/3,| + Ag||/3|| 

-7 



/3. 



(20) 
(21) 



The idea here is that important coefficients will have large starting estimates f3j (where /3j is an unbiased 
estimator of /3j) and therefore they will be shrunk at a rate inversely proportional to their unbiased starting 
estimates, leaving them unbiased in the final model. Coefhcients with small starting estimates (3j will 
experience additional shrinking and will likely be excluded from the model. We let 7 = 1 as used previously 
in the finite sam ple case (Zou 2006 Zou and Zhang||2009 ), and by analogy to the adaptive elastic net (Zou 
and Zhang 2009 1 use /3 = jS^'^^ for some fixed values of and 5 (chosen based on the robust GraphNet 



performance at those values). 

It is important to note that the oracle properties that hold in the asymptotic case may not apply to 
our finite sample, p » n situation. Nevertheless we include these models for comparison as such oracle 
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properties are desirable and there is evidence suggesting that because the adaptive elastic-net deals well 
with coUinearity arising from having many correlated inputs, it has improved finite sample performance for 
n » p problems ( jZou and Zhang|[2009 1 . 

Finally, we discuss a heuristic alternative to adaptive methods for adjusting nonzero coefficient magni- 
tudes to be on the scale of the original data. This approach can be used with any of the above models. 



2.2.6. Coefficient reseating 



The elastic net was originally formulated by (Zou and Hastie 2005| in both a "naive" and a rescaled 



form. The authors note that a combination of an li and £2 penalty can "double shrink" the coefficients. To 



correct this they propose rescaling the "naive" solution by a factor of k = 1 + A2 (Zou and Hastie 20051. 
Heuristically, the aim is to retain the desirable variable selection properties of the elastic net while rescaling 
the coefficients to be closer to the original scale. However, it is not clear that k = 1 + A2 is the correct 
multiplicative factor if the data are coUinear, and this can complicate the problem of choosing a final set of 
coefficients for a model. 

A simple alternative is to fit the elastic net, generating a fitted response y, and then to regress y on y. 
In particular, solving the simple linear regression problem 

y = ny — kX/3, k G M, 

yields an estimate k that can be used to rescale the coefficients obtained from fitting the Elastic Net (Daniela 
Witten and Robert Tibshirani, personal communication). The intuitive motivation is that this should 
produce the k which puts /3 and y on a reasonable scale for modeling y. 

Besides its simplicity, the principal advantage of this approach is that is requires no analytical knowledge 
about the amount of shrinkage that occurs as A2 is increased. This is particularly appealing because the 
same strategy of regressing y on y can be used with more general problems, such as the GraphNet. For 
example, the "double shrinking" caused by the graph penalty can be corrected in this manner. 

Finally, we describe one important exception in the binary classification case with balanced data that 
has been centered. If we are fitting a linear model of the form Y = X/3, but we have recentered the data 
such that we fit 

then the model for Y is 

and the intercept we are implicity fitting is given by ^^jXjPj . The balancing of trials ensures that Y 

is always zero. This is very convenient, for if we consider a new model /3 — kI3 we don't need to refit the 
intercept, and the predictions become 

Y'^K^X,f3,^Y.^=^^ 

This means that the balanced two-class classification problem on centered data does not depend on any 
scaling of parameters, even with the intercept term in the model. 



2.3. Interpreting GraphNet regression and discriminant models 



2.3.1. GraphNet as as constrained maximum likelihood problem 
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Given some observed fMRI data, we would like to maximize the likelihood of our model given the data, 
subject to some constraints on the model coefhcients-specifically that they are sparse and structured. One 
way of writing this goal more formally is 



maximize 



loglik(/3|X,y) 



subject to ||/3||i < ci 

\mG< ea- 



rn 

(23) 
(24) 



The particular log-likelihood function in (22) is determined by the distribution that we assume for the 
observation noise in equation ([T|). and the constraints ci € and cq G ^+ are hard bounds on the size 
of the coefficients in the £i and norms, respectively. Assuming identical and independently distributed 
(i.i.d.) additive Gaussian noise results in the squared-error loss function used in the standard GraphNet 
problem ([9]) , while a mixture of Gaussian and Laplacian noise yields the Huber loss ^ , with the particular 
mixture proportions set by the parameter 8 ( Huber|[2009 ) . 

If for desired values of the hard constraints ci and cq are somehow known beforehand, a globally optimal 



solution to (22l-(24l can be found using a projected subgradient method on the dual problem. In fMRI 



analysis, however, a priori values for ci and cq are rarely known in advance. Therefore, instead of solving 
the problem for fixed for ci and cg, we consider the Lagrangian relaxation. 



L(/3,Ai,Ag) = -loglik(/3|X,y) + Ai(||/3|| 



Ag(II/3||^ 



cg), Ai, Ag e 



(25) 



where the hard constraints ci and cq have been replaced by nonnegative linear penalties Ai and Ag (Boyd 
and Vandenberghe|2004 1. When this function is minimized with respect to the ci and cq terms disappear, 
so given particular values of Ai and Ag, we can equivalently solve the penalized maximum likelihood problem 



argmm 



-loglik(/3|X,y) + Ai||/3||i + XgWWI, Ai, Ag G M+, 



(26) 



characteristic of the GraphNet estimators — with the particular log-likelihood function determining which 
loss function C is used. This formulation leads to the following interpretations of the model parameters and 
model coefficients. 



2.3.2. Interpreting model parameters: dual variables as prices 



In the Lagrangian formulation, the 'dual variables' Ai and Ag express our (linear) irritation with any 
violation of the constraints. Since we solve problem |26| ci and cg are effectively zero, and we are penalized 
for any deviation of the coefficients from zero. We can thus interpret Ai and Ag as the prices we are willing to 
pay to improve our model likelihood at the expense of a less sparse or less structured solution, respectively. 

Examining the sensitivity of model fit to different values of Ai and Ag can therefore tell us about 
underlying structure in the data. For example, if the neural activity related to a particular cognitive task 
was very sparse and highly localized in a few uncorrelated voxels, then we should be willing to pay more 
for sparsity and less for smoothness (large Ai, small Ag). In contrast, if large smooth and correlated 
regions underlie the task then tolerating a large Ag could help our model fit substantially. To explore 
such possibilities, we can plot median test rates from cross validations at different combinations of model 
parameters. Figure 5 shows example contour plots of median rates as as function of Ai and Ag over the 
parameter grid on which we fit the GraphNet and robust GraphNet classifiers. 

2.3.3. Interpreting model coefficients 



Problem [26] can also be arrived at from a Bayesian perspective as a maximum a posteriori probability 
(MAP) estimator. In this case, the form of the penalty V{P) is related to our prior beliefs about the structure 
of the model coefficients. For example, under the well-known equivalence of penalized regression techniques 
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and posterior modes, the elastic net penalty corresponds to the prior 



PAi,A.(/3)(xexp{-(Ai||/?||i+A2||/3||^)}, 
( Zou and Hastie|[2005 ). The GraphNet penalty thus corresponds to the prior 

PAi,Ag(^) « exp{-(Ai||/3||i + AG^^G/3)} 

p ^ { 

cx J|exp{-Ai|/3j |} J|exp < -Ac ^ ftG„/3j 



(27) 



where i ^ j denotes that node i in the graph G is adjacent to node j. Therefore, the GraphNet problems are 
also equivalent to a MAP estimator of the coefficients with a prior consisting of a convex combination of a 
global Laplacian (double-exponential) and a local Markov Random Field (MRF) prior. In other words, they 
explicitly take into account our prior beliefs about the model coefficients being globally sparse but locally 
structured. 

It is important to note that because this is a binary classification task, the bias-variance relationship is 
more complicated than that found in the textbook squared-error loss case (where there is a simple additive 
relationship). In particular, due to a multiplicative interaction between bias and variance in the case of 
0-1 loss, models with negative decision boundary bias are insensitive to "over-smoothing" in the balanced 
sample case Friedman (19971. This is in contrast to regression or class probability estimation where "over- 



smoothing" degrades performance much more sharply. 

As discussed above, OS can be used to turn regression methods into discriminant classifiers. This allows 
us to use notions from regression like degrees of freedom, as well as those utilized in discriminant analysis, 
such as class visualization in the discriminant space using discriminant coordinates, or trial- by-trial posterior 



probabilities for individual observations ( jHastie et al. 1995 1. Having trial-by-trial probabilities of observations 
being in one class or another may prove particularly interesting the in functional imaging setting, where 
behavioral variables are available at the single trial level (e.g. reaction time, product preference). 



2-4-. Model fitting and computational considerations 



2.4-1- Coordinate-wise decent and active set methods 



Fitting regression models to whole-brain fMRI data demands efficient computational methods. This is 
particularly true if we would like to cross-validate the model fit over a grid of possible parameter values. 
In the application below (section 2.5), we have 26,630 input features (voxels) at each of 7 time points. 
Fitting the adaptive robust GraphNet using leave-one-subject out (LOSO) cross-validation (25 fits) for each 
realization of possible parameter values over a90x5x6xl0x3 grid of possible values {Ai, G, Xq, S, XI} 
requires 2,025,000 model fits on 1,882 observations of 186,410 input features. 

To make millions of model fits over hundreds of thousands of input features feasible, we formulated our 
minimization problems — equatio ns ([9l)([l3| and ( 18 1 — as coordinate- wise optimization problems (Tseng'2001") 
solved using active set methods (Friedman and Tibshirani 20101. In this formulation we fit one coefficient 
value at a time ("coordinate-wise" descent), holding the rest constant, and keep an "active set" of those 
nonzero coefficients that improve the fit enough to not being shrunk out of the model (to zero). We start 
with a large value of Ai, corresponding to all coefficients being zero, and then slowly decreasing Ai to allow 
more and more coefficients into the model. This allows us to consider an 'active set' of the model coefficients 
rather than all 186,410 inputs at each coordinate-wise update. Occasional sweeps though all the coefficients 
are made to look for new variables to include, as in ( ^Friedman and Tibshirani,,2010| ). We stop decreasing Ai 
before it reaches zero, as this would yield a dense solution that would both be computationally expensive 
and yield poor estimates ( Hastie et al.||200^ |. 
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In using coordinate-wise descent, we rely on the fact that our regression methods can be formulated as 



argmin /(^i, 



,l3p) = argmin g{(3i, ..,/3p) 



J = l 



(28) 



where (?(/3i, --j/^p) is a convex, differentiable function (e.g., squared-error and Huber loss plus our quadratic 
penalty ||/3||g)i ^^id where each /i(/3j) is a convex (but not necessarily differentiable) function (e.g., the l\ 
penalty) . If the convex, non-differentiable part of our penalty is separable in coordinates /3j — as is true of 
||/3||i = y ^^^i l/^ jl — then coordinate descent can be shown to converge to global solution of the minimization 
problem (Tseng 20011. In the case of Huber loss or Huberized-hinge loss, we first rewrite the two-part loss 
function as a single quadratic loss function using infimal convolution as described above. 

As a specific example of solving a GraphNet problem with coordinate descent, consider the coordinate- 
wise updates for the standard GraphNet problem given in equation ([9]). Letting y = X/3 + X.jPj (where 
X = X.^j is the matrix X with the jth column removed, and /3 = /3^j the coefficient vector with the jth 
coefficient removed), the gradient of the risk with respect to just the jth coefficient (holding the others fixed) 



X^ 



Xfi- 



X.TX.,P, + {X,/2)P^{G^,.)., + X2G,,I3, + (Ai/2)r(/3,) (29) 

0. Basic algebra 



where the subgradient r{/3j) = -1 if < 0, r{f3j) = 1 if > and T{/3j) e [-1, 1] if /3j 
then gives the coordinate update iterate for the jth coefficient estimate: 



S{x.j{y 



Xp)-{X2/2)P^iG^,.).„ Ai/2 



X^X, 



(30) 



where S{x,j) — sign(a;)(|x| — 7)+ is the soft-thresholding function (Friedman et al. 2007a). Note that if 
graph G = I, and the columns of X are standardized to have unit norm we recover the coordinate-wise 



Elastic Net update |der Kooij, (2007, ) ; [Friedman et al.| ( |2007b I . 



2.4-2. Computational complexity 



Looking more closely at equation 31 we see that if the variables are standardized (such that X.JX.j = 1) 
then the (c -|- l)st coefficient update for the jth coordinate can be rewritten 

N 

pf+^) ^ s I ^x,,rf) - (A2/2)^/?feGfe„ Ai/2 | /(I + A^G,,) (31) 

where r — y — y is the vector of residuals. Letting m be the number of off-diagonal nonzero entries in 
G and initializing with /Sj^"* — for all j and r*^°-' — y, the first sweep through all p coefficients will take 
0{pn) + 0{m) operations. Once oi variables are included in the active set, q iterations are performed 



according to (31 1 until the new estimates converge, at which point Ai is decreased incrementally and another 
0{pn) sweep made through the coefficients to find the next active set with 02 variables (using the previous 
estimate as a warm start to keep q small). This procedure is repeated for / values of Ai, until the model fit 
stops improving or a pre-specified coefficient density is reached. Let a — X]'=i denote the total number 
coefficients updates over all I fits. The total computational complexity is then 0{lpn) + 0{lm) + 0{aq). 
Thus if G is relatively sparse (so m is small) and if it requires few iterations for coefficients in an active set 
to converge (q small) — which is true if the unpenalized loss function is quadratic — then the computational 
complexity is dominated by the 0{lpn) term representing the sweep through the coefficients necessary to 
find the next active set for each new value of Ai. Either making G dense or decreasing Ai until a becomes 
large can cause the other complexity terms to play a significant role and slow the speed of the algorithm. 
For example, if G is dense than m — p^ — p and the Ollm) term will dominate. 
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Product Price Choice Fixate 




4s 4s 4s 2s 

T1 T2 T3 T4 T5 T6 T7 

Time (s) 

Figure 3. Save Holdings, or Purchase (SHOP) task trial structure. Subjects saw a labeled product (product period; 4 s, 2 
TRs), saw the product's price (price period; 4 s, 2TRs), and then chose either to purchase the product or not (by selecting 
either "y^s" or "no" presented randomly on the right or left side of the screen; choice period; 4 s, 2TRs), before fixating on a 
crosshair (2 s, ITR) prior to the onset of the next trial. 

2.4-3. Cross validation, classification accuracy, and parameter tuning 

For classification, data were down-sampled to ensure equal numbers of trials per class. This allowed 
us to conservatively assess significance of binary classification rates using the binomial distribution with 
a predicted 50% success rate. For 10-fold cross validation, this down-sampling occurred once so that the 
overall dataset was balanced, and then observations were assigned to folds. In the leave-one-subject-out 
cross-validation the down-sampling was done on a per-subject basis, so that it would be within folds (each 
subject's data is a test fold). We remark that the range for these grid values we chosen based off some 
preliminary fits, and that more efficient adaptive and analytical approaches to parameter search may be 
more effective. The grid values used are given in the Appendix. 

2.5. Application: Predicting buying behavior using fMRI (Knutson et al, 2007) 
2.5.1. Subjects and SHOP task 

Data from 25 healthy right-handed subjects were included in these analyses (one subject's original fMRI 
data could not be recovered and so was omitted). Along with the typical magnetic resonance exclusions (e.g., 
metal in the body), subjects were screened for psychotropic drugs and ibuprofen, substance abuse in the past 
month, and history of psychiatric disorders (DSM IV Axis I) prior to collecting informed consent. Subjects 
were paid $20.00 per hour for participating and also received $40.00 in cash to spend on products. In addition 
to the 25 subjects who were included in the analysis, 6 subjects who purchased fewer than four items per 
session (i.e., < 10%) were excluded due to insufficient data to model, and 8 subjects who moved excessive 
amounts (i.e., > 2 mm between whole brain acquisitions) were excluded. 

While being scanned, subjects participated in a "Save Holdings Or Purchase" (SHOP) Task (Figure 
3). During each task trial, subjects saw a labeled product (product period; 4 sec), saw the product's price 
(price period; 4 sec), and then chose either to purchase the product or not (by selecting either "yes" or "no" 
presented randomly on the right or left side of the screen; choice period; 4 sec), before fixating on a crosshair 
(2 sec) prior to the onset of the next trial (see Supplement 1 for illustration of the task layout). 

Each of 80 trials featured a different product. Products were pre-selected to have above-median at- 
tractiveness, as rated by a similar sample in a pilot study. While products ranged in retail price from 
$8.00-$80.00, the associated prices that subjects saw in the scanner were discounted down to 25% of retail 
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value to encourage purchasing. Therefore the cost of the products during the experiment ranged from $2.00 
to $20.00. Consistent with pilot findings, this led subjects to purchase 30% of the products on average, 
generating sufficient instances of purchasing to model. 

To ensure subjects' engagement in the task, two trials were randomly selected after scanning to count 
"for real". If subjects had chosen to purchase the product presented during the randomly selected trial, they 
paid the price that they had seen in the scanner from their $40.00 endowment and were shipped the product 
within two weeks. If not, subjects kept their $40.00 endowment. Based on these randomly drawn trials, 
seven of twenty-five subjects (28%) were actually shipped products. 

Subjects were instructed in the task and tested for comprehension prior to entering the scanner. During 
scanning, subjects chose from 40 items twice and then chose from a second set of 40 items twice (80 items 
total), with each set in the same pseudo random order (item sets were counterbalanced across subjects). 
We consider only data from the first time the item was presented here (see (Grosenick et al. 2008[ ) for a 



comparison between first and second presentations). After scanning, subjects rated each product in terms 
of how much they would like to own it and what percentage of the retail price they would be willing to pay 
for it. Then, two trials were randomly drawn to count "for real", and subjects received the outcome of each 
of the drawn trials. 



2.5.2. Image acquisition 

Functional images were acquired with a 1.5 T General Electric MRI scanner using a standard birdcage 
quadrature head coil. Twenty-four 4-mm-thick slices (in-plane resolution 3.75 X 3.75 mm, no gap) extended 
axially from the midpons to the top of the skull, providing whole brain coverage and adequate spatial 
resolution of subcortical regions of interest (e.g., midbrain, NAcc, OFC). Whole brain functional scans were 
acquired with a T2*-sensitive spiral in-/out- pulse sequence (TR=2 s, TE=40 ms, fiip=90), which minimizes 



signal dropout at the base of the brain (Glover and Law 20011. High-resolution structural scans were also 



acquired to facilitate localization and coregistration of functional data, using a Tl-weighted spoiled grass 
sequence (TR=100 ms, TE=7 ms, flip=90). 



2.5.3. Preprocessing 

After reconstruction, preprocessing was conducted using Analysis of Functional Neural Images (AFNI) 



software (Cox 1996). For all functional images, voxel time-series were sine interpolated to correct for non- 
simultaneous slice acquisition within each volume, concatenated across runs, corrected for motion, and 
normalized to percent signal change with respect to the voxel mean for the entire task. Given that spatial 
blur would artificially increase correlations between variables for the voxel-wise analysis, we used data with 
no spatial blur and a temporal high pass filter for all analyses. Note that in general smoothing before running 
analyses will compound the problems with correlation mentioned above, and result in rougher coefficients 
overall ( [Hastie et ar]|1995[ ) 



Spatiotemporal data were arranged as in previous spatiotemporal analyses ( Mourao-Miranda et al.|2007 1 



Specifically, data was arranged as an n x p data matrix X with n corresponding to the number of trial 
observations on the p input variables, each of which was a particular voxel at a particular time point. This 
yielded 26,630 voxels taken at 7 time points (each taken every 2 seconds), yielding a total of p = 186,410 
input input features per trial. Altogether, these data included n = 1, 882 trials across the 25 subjects. 

3. Results 

3.1. Classification rates 
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Table 1: Median accuracy and parameters for SPDA and SVM classifiers fit with Leave-5-Subjects-Out (L5SO) cross-validation. 



Model 



Training Test 



Sparse 



Correlated Structured Robust Adaptive 



98.75% 68.5% 
90.38% 72.5% 



Lasso^ 
Elastic Net^ 
GraphNet^ (GN) 
Robust GN (RGN) 
RGN + temporal 
Adaptive RGN 
ARGN + temporal 90.75% 73.5% 



86.86% 73.73% 

86.75% 74.5% 

96.5% 73.75% 

91.38% 73.75% 



GraphSVM 



85.3% 73.0% 



Ai = 33 
Ai = 54 
Ai = 68 
Ai = 43 
Ai = 42 
Ai = 50 
Ai = 40 
Ai = 120 



Aa = 10000 
As = 1000 
A2 = 100 
A2 = 1000 
A2 = 10000 
A2 = 1000 
A2 = 1000 



Ag = 100 

Ag = 100 0.3 

Ag = 10 (5 0.5 

Ag = 100 S = 0.4 A* = 0.01 

Ag = 100 (5 = 0.3 A^ = 0.01 

Ag = 10 S = 0.5 



Linear SYM"* 



97.94% 71.0% 



C = 3.8 X 10- 



^( |Tibshirani||l996[ ), ' ^Zou and Hastie||2005[ ), ^( [Grosenick et al.||2009| ), ^Cortes and Vapnik| ( |l995[ ). Chance level is 
50%. 



If models that use brain data to predict choice generalize well across subjects, it implies an invariance in 
the neural substrates of choice across individuals. We compared the abilities of the proposed SPDA models 
to predict eventual choices to purchase with a more commonly used linear SVM model, examining both 
generalization to held-out test sets from data pooled across subjects, and to test sets consisting of subjects 
held out entirely from the fitting procedure. Best results for the SPDA classifiers and linear SVM fit across 
the 25 subjects (and the model parameters for each) are given in Tables 1 and 2, as well as a summary of each 
model's properties. Models were fit using either leave-one-subject-out (LOSO, Table 1) leave-5-subjects-out 
(L5S0, Table 2) cross-validation, and we report both training and test results to examine to what extent 
the models overfit the training data. 



3.1.1. Generalization to held-out subjects (L5S0 cross-validation) 



Best median training and test rates are given for SPDA models fit over the grid of parameters given in 
( 29 1 and the SVM parameters given in ( 32 1 . Despite a more than 1000-fold increase in the number of input 
features relative to previous region of interest analyses fCrosenick et al. 2008 1 , the whole-brain classifiers 

2008). 



performed significantly better than previous ROl-based models ^Knutson et al. (20071; Grosenick et al 
Further, we note that robust models may be more effective on this data, as decreasing 5 in the SPDA-RGN 
(making it more robust, see Figure 2) improved its rates over the GraphNet. This is also consistent with 
the SVM performing well, as the hinge loss employed by the the SVM is robust against outliers. Still, these 
effects are less pronounced than the improvement of the GraphNet methods over penalization alone. Note 
that the lasso and linear SVM tend to overfit the training data more, while the SPDA classifiers have more 
modest training rates. Overall, the best median L5S0 rate was from the robust GraphNet, with accuracy 



18 



Table 2; Median accuracy and parameters for SPDA and SVM classifiers fit with Leave-One-Subject-Out (LOSO) cross- 
validation. 



Model 


Training 


Test 


Sparse 


Correlated 


Structured 


Robust 


Adaptive 


Lasso^ 


90.52% 


68.75% 


Ai = 66 










Elastic Net^ 


90.83% 


70.0% 


Ai = 64 


A2 = 1000 








GraphNet^ (GN) 


87.5% 


71.25% 


Ai = 59 


A2 = 10000 


Xg = 1000 






Robust GN (RGN) 


83.75% 


72.5% 


Ai = 26 


A2 = 1000 


Xg = 1000 


6 = 0.2 




RGN + temporal 


83.75% 


72.5% 


Ai = 35 


A2 = 1000 


Xg = 1000 


6 = 0.3 




Adaptive RGN (ARGN) 


85.4% 


72.5% 


Ai = 20 


A2 = 1000 


Xg = 1000 


6 = 0.2 


XI = 0.01 


ARGN + temporal 


88.33% 


73.75% 


Ai = 30 


A2 = 1000 


Xg = 100 


6 = 0.2 


A* = 0.01 


GraphSVM 


89.48% 


73.75% 


Ai = 88 


A2 = 100 


Xg = 100 


S = 0.5 




Linear SVM" 


91.56% 


68.75% 




C = 7.6 X 10-*^ 









50%. 



on held-out test data of 74.5% (the linear SVM accuracy was 71.0%). Examing the distribution of the rates 
across the 25 folds, Figure 4b shows that the SVM appears to have less variance across fits to held-out 
subjects. The marked non-normality of these distributions is interesting, and motivated our reporting the 
median rather than mean accuracy over cross-validation folds. 

3.1.2. Generalization to held-out individuals (LOSO cross-validation) 

In addition to cross-validation over pooled data across subjects, we also ran leave-one-subject-out (LOSO) 
cross validation (i.e., using the data from 24 subjects to predict results for each remaining subject). Repeating 
this procedure for all subjects yielded one held-out classification rate per subject, indicating how well the 
group model generalized to that subject. Repeating this for all subjects yielded one held-out test rate per 
subject. This rate indicates how well the model fit to all the other subjects generalized to the held-out 
subject — a measure that an approach that may be of interest in studies of individual differences. Figure 4a 
shows smoothed histograms of the LOSO classification rates for the SPDA-RGN and SVM analyses. Overall, 
the GraphNet methods outperform the linear SVM on LOSO cross-validation. Critically, not only are the 
GraphNet methods more performant, but they are also interpretable. 

3.2. Visualization and interpretation of coefficients and parameters 
3.2.1. Interpreting model coefficients 



Although SPDA-RGN and SVM models classified future purchase choices similarly overall, SPDA models 
have the comparative advantage of producing interpretable results. Given the temporal as well as spatial res- 
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Leave-one-subject out CV 



Leave-5-subjects-OLit CV 




I 1 1 i 1 1 I 1 1 1 ' 1 

0.0 0.2 O.-l 0.6 O.S 1.0 Oil 0-2 U-fi OA 1.0 

ClassifWatkMi Aficufacy ClassilicaliOrt ftoouiacy 

Figure 4. Smoothed histogram densities of leave-one-subject out test rates. Models were fit to all subjects except one, and 
then tested on the held-out subject. This was done for all subjects and smoothed histograms of these rates were calculated for 
the Robust GraphNet (5 = 1, blue lines) and linear SVM [C = 4, red lines). 



olution of the present dataset, interpretation might extend not only to which brain regions predict purchasing 
choices, but also when. Thus, an investigator who knows when different events occurred (and accounts for 
the lag in the hemodynamic response) might infer that different events promoted eventual purchasing choices 
by altering activation in specific regions. Applied to the SHOP dataset, SPDA-RGN revealed several novel 
findings (Figure 5). First, consistent with previous reports (Knutson et al. 2007[ [Grosenick et al. 20081, 



nucleus accumbens (NAcc) activation began to positively predict purchase choices at the time of product 
presentation, and this persisted throughout the price period. Medial prefrontal cortex (MPFC) and midbrain 
activation, on the other hand, began to positively predict purchase choices when the price appeared (but not 
during the product presentation period). Additionally, and not included in previous findings, posterior cin- 
gulate activation also began to positively predict purchase choices during price presentation. Reassuringly, 
no regions predicted purchase choices during fixation presentation. Finally, we note that the best models 
chooses far more voxels that positively than negatively predict purchasing. 



3.2.2. Interpreting model parameters 



Figure 6 shows contour plots of median L5S0 cross-validation rates over the values {Ai, Ag,G} (29 1 on 
which we fit the GraphNet and robust GraphNet (results shown are for 5 = 0.3). Overall we find that for 
this particular data set that there is an optimal range of parameters that deliver similar model performance 
and coefficient values. Sparser models such as the lasso do not perform well, as shown by the dotted white 
line in Figure 6a. Comparing the rates in Figure 6 suggests that for this data a certain amount of coefficient 
smoothness and inclusion of correlated variables in the final model is important. 



4. Discussion and Conclusions 



Application to SHOP task data 



With respect to the specific application to prediction of choice, the current methods offer several ad- 
vantages over previous region of interest based analyses. Reassuringly, the GraphNet classifiers delivered 
findings that overlapped with prior region of interest based results, replicating the observation that nucleus 
accumbens (NAcc) activation begins to predict purchase choices during product presentation while medial 
prefrontal cortical (MPFC) activation begins to predict purchase choices during price presentation. It is 
also interesting to note areas that were not included by the models. While one account posits that in the 
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context of fMRI, NAcc activation indexes gain predictions ( Knutson et al.pOOl Knutson and Greer||2008 1, 
an alternative account posits that NAcc activation instead indexes gain prediction errors, while MPFC ac- 
tivation indexes gain predictions (e.g., Hare et ar]|2008 ). To the extent that gain predictions forecast future 
events while gain prediction errors reflect past events, the gain prediction account posits that NAcc activa- 
tion in response to products should predict later purchase choices, whereas the gain prediction error account 
instead posits that MPFC activation in response to products should predict later purchase choices. SPDA- 
RGN results clearly support the gain prediction account, since NAcc activation in response to products 
predict choices to purchase, whereas MPFC activation does not (instead predicting choice in response to 
later presented price information - consistent with a value integration account). SPDA-RGN also revealed 
a previously unnoticed flnding in which posterior cingulate activation clearly appears to predict purchase 
choices at price presentation. Accounts of posterior cingulate function in valuation remain less developed 
than similar accounts of NAcc and MPFC function. Nonetheless, this result might be consistent with at- 



tentional and salience-based accounts of posterior cingulate function McCoy et al. ( 2003 1 and highlights a 
region deserving of future study in the context of predicting choice. 



4-.2. Interpretable models for whole-brain fMRI 



We set out to design and develop a model for fMRI data that was able to yield interpretable results for 
whole-brain data over multiple time-points while yielding accuracy competitive with current state-of-the-art 
methods for large data. In addition we hoped that this model would choose relevant features in a principled 
way-including all relevant features while excluding nuisance parameters-and allow us to flexibly constrain 
model coefhcients using prior information. Further, we hoped the resulting coeflicients would be interpretable 
in the native data space (voxel time series) and that their magnitude would be relatively unbiased despite 
the shrinkage methods employed to yield sparsity. 

The GraphNet-based models presented here are a good first pass at providing these desirable-and at 
times competing-model properties. In particular, the adaptive robust GraphNet allows automatic variable 



selection (and is an adaptation of models that are asymptotically model selection consistent Zou and Zhang 



(2009)), incorporation of prior information in the form of a graph penalty, and yields minimally biased 
coeflicients as a result of adaptive reweighting. These methods can be used in either the regression or 
classification setting (using optimal scoring), and our current application suggests competitive rates with 
state-of-the-art classifiers for high dimensional data. 

Still, this is only one example application, and further validation of performance is necessary. In addition, 
several caveats should be considered. First, the adaptive methods used to "unshrink" model coeflicients may 
be unstable in the small sample case, as we discuss above, so heuristic methods like the one described 
above may be more effective in some circumstances. As the valance and relative magnitude of the per- voxel 
coeflicients may be more important for interpretation than their absolute magnitude in many cases, this 
may not be as important as variable selection consistency, which dictates which voxels are considered for 



interpretation. However, the model selection consistency results established for the elastic net (Jia and Yu 



2009 , De Mol et al. 2009 1 and adaptive elastic net ( Zou and Zhang 2009 ) are recent and must be further 



validated and carefully extended to the GraphNet models. 

Finally, a critic could justly claim that a factor analysis (or other projection onto a lower dimensional 
manifold) could be used to reduce the number of features prior to applying regression or classiflcation methods 
to potentially improve computational speed and results. This would of course depend on the efficiency of 
their dimensionality reduction method, and would not guarantee results interpretable in terms of the original 
data space. However, some previous work suggests that factor analytic methods applied to functional data, 
when combined with a sparse regression method to choose latent factors of interest, can yield satisfactory 
results for very large data (millions of input features), and can be projected back into the original data 
space for interpretation ( [Grosenick et al. In Press, Wager et al. 20111. Our results here suggest that such 
an approach paired with recently developed methods for sparse, structured PCA (Allen et al. Submitted! 
would be particularly useful in fMRI applications. 



4-.3. Future directions 
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There are a variety of graph constraints besides the spatial matrix Laplacian to explore, including (1) 
weighted graphs derived from the data to be adaptive to local smoothness, (2) cut-graphs derived from 
segmented brain atlases that would allow adjacent regions known to be functionally distinct to be penalized 
independently, and (3) weighted graphs derived from diffusion data, which would allow constraints on voxels 
adjacent on an connectivity graph rather than in space or time. In addition, using the goodness-of-fit 
measure provided by the model to make inferences about which of a set of structural graphs best relates 
to functional data, or to adaptively alter graph weights to explore structure in functional data. Finally, 
all of the methods considered above model relationships between input features and the target variable as 
linear. While this approximation works well in many cases, signal saturation effects alone dictate that it be 
false. Recently, nonlinear methods based on scatterplot smoothers have been developed that work well with 
coordinate- wise methods ( Ravikumar et alT]|2009 ) . We intend to explore these along with structured feature 



selection methods to yield more realistic interpretable models for fMRI data. 
Appendix A 

Robust GraphNet: coordinate-wise coejficient updates using infimal convolution 

For a particular coordinate j , we are interested in the estimates 

7, = argmin (1/2)||2; - Z7 - Z.,^,\\l + A2 (7^(G;,^-.).,7, + G'^,lj) + Ai|7j| if J e {1, ...p}, 
7j- = argmin (1/2)11?; - Z7- Z.j7j||2 + (5|7j| if j e {p+ 1, + 

7j 



with updates 



S (Z.jiy - Zj) - (A2/2)7^(G'_,..).„ A1/2) 
7w — ^ if 7 e |1, pj, 

S(z.jiy-Zj), A1/2) 

Ij ^ — if j e {p+l,...,p + n}. 

Z.^ Z.j 



SVM-GraphNet classification: coordinate-wise coefficient updates using infimal convolution 



We can take the same approach with the SVM-GraphNet estimates, which we can write as 

p+n 



P 

7 = argmin (1/2(5) II l„xi - ^7ll2 +A27^oG'7#o + I + XI inax(0, 7^-) 



7 



if j = 

where Z = [y^[l„xi X] /„xn] , 1 = [Po P a], Wj = l\i if j = l,..,p 



1 if j =_p-H l,...,_p + n 



G' 



Oixp Oixn 

Opxl G Oixn 
Onxl 0,1x1 Onxn 



e S[ 



(p+n+l)x(p+n+l) 
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During coordinate wise descent, only one of the separable penalty functions have an "active" variable per 
descent step. Letting /i(7j) = max(0, 7^), we thus have 



fargmin (l/25)||l;vxi - ^7 - Z.j^j\\l 



if j = 



7j 



7j = < 



gmin (1/25)111^x1 - ^7 - Z.j^j\\l + A2 (l^iG'^j-h^j + G'^^j]) + if j e {1, ...,p} 

if j e {p+l,...,p + n}. 



argmin (l/2(5)||lArxi - Zj - Z.^j^\\l + h{-fj) 



Then since 



Z.J = ( 



Ijvxi ifj = 

X.j if j e 

Cj if j e {p+l,...,p + n}, 



we have 



7j ^ < 



where yielding update: 



-ijj^.z., + (z^rz., + ^f,z:^z., if J = 

g((Z7)^g.,-l^xl^-.-(^2/2)7^(G;,,.).,, A1/2) -f ■ ^ . 



(j'^ZlMxl+N{^j-l) 



if i e {p+ l,...,p + n}, 



ifi = 



g((Z^7-livxi)^^.j-(A2/2);3^(G^j.).j, Ai/2) . 



.ff((Z7),-l,<5) 



if i e {p+ l,...,p + n}. 



Algorithm 1 Robust GraphNet update using infimal convolution 

1. Given a set of data and parameters = {X, j/, Ai, A2}, previous coefficient estimates a^'"\^(''\ and 
px p positive semidefinite constraint graph G e S^^, let 



2. Choose coordinate j using essentially cyclic rule and fix 7 = {7^'^^!^ ^ j}, Z = Z.^j,j3 = {li^'\k ^ j}, 



3. Update 7^ using 



7. 



(r+l) 



S(Z.ffa-Z7)-(A2/2)7^(G;,^..).„ Ai/2) .. . . 

z.^z.,+A2G',,- It J i-L,•••,PJ■ 



5((y-Z7),•, A1/2) if J + 

4. Repeat steps (l)-(3) cyclically for all j e {1, . . . ,p + n} until convergence. 
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Algorithm 2 SVM (jrajjliXot classilicat ion ui)(lat(> using inlinial convolution 

1. Given a set of data and parameters = {X,y, Xi, X2}, previous coefficient estimates a''^\ j3Q~\ 
and px p positive semidefinite constraint graph G G S^^, let 

Z = [j/^[lnxl X] /„xn] • 



2. Choose coordinate j using essentially cyclic rule and fix 7 = 7^ j}, Z = Z.^j,/3 = {/3'j^'\k ^ j}, 

X = X.^j . 



3. Update 7^ using 



7. 



(r+l) 



7^Z1„xi+jV(7, -1) ifi = 

S((Z-^7-l„xirx.j-(A2/2)g'^(G^j.).,-, Ai/2) 



4. Repeat (1) -(3) cyclically for all j G {1, . . . ,p + n}. 



Parameter grid used in cross-validation 

Parameters {Ai, G, Ag, 5, Af} were taken over the following grid of values: 

Ai G {10, 11,..., 99} 

G e [L, L + r]I, L + lO^rjI, L + lO^rjI, L + W^rjl} where = 1/Ag for Ac > and 1 otherwise 

Ag e {0, io\ lo^ lo^ 10^, 10^} 

^ e {0.2,0.3,0.4,0.5,0.6,0.7,1,2,10,100} 

At e {i,io-\io-2}. 

The linear SVM was fit over parameters 

c e {lo-^ lo-^ 10"^, lo-^ lo-^, io-\ io-°, 2, 3, 4, 5, e, 7, io\ 10^, 10^}. (32) 
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Robust GraphNet Classifier (L5S0 cross-validation) 
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Adaptive Robust GraphNet Classifier (LOSO cross-validation) 
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Figure 5. Whole-brain classification results from the SHOP task (see Figure 3 for task structure). (Top) Median coefficient 
maps from the best robust GraphNet classifier fit using Leave-5-subject-out (L5SO) cross-validation are shown at two time 
points during each for product period, price period, and choice, period, as well as the fixation period. Warm colored 
coefficients denote areas that predict purchasing a product, while cool-colored areas those that predict not purchasing. The 
areas chosen by the robust GraphNet classifier inc lude the bilateral nucleus accumbens (NAcc ) and the mesial prefrontal 
cortex (MPFC), as suggested by previous studies ( [Knutson et al.|2007[ [Grosenick et al.|2008^ , but also show the VTA and 
posterior cingulate to positively predict purchasing. (Bottom) Himilar plots tor the adaptive robust GraphNet classifier fit 
using leave-one-subject (LOSO) cross-validation. 
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Figure 6. Median test accuracy as a function of model parameters, (a) shows a smoothed image of the median test accuracy 
of the GraphNet SPDA classifier (GN) as functions of hyperparameters Ai and Xq (with A2 = 0). Warm colors indicating 
median classification accuracy above 70% (for L5SO cross-validation) and cool colors median accuracy below 70% (see 
colorbar for scale). The dotted line indicates the standard lasso solution at Xq = 0. There is a clear maxima at 
Ai = 40, Xo = 100. (b) shows similar plots for the GraphNet SPDA classifier (GN) and Robust GraphNet SPDA classifier 
(RGN) at three values of the graph diagonal scale A2. 
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